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Titleiof the Invention 
EVALUATION AI^D OPTIMlSATldN OF CODE 

I 

Field of the Invention 

The present invention relates to the evaluation and optinnisation of code, 
particularly to be used In a processor including a catiJhe. 

Background of the Invention 

In the field of computer systems, cache memories land their use are well knov\^n. 
However, a brief discussion follows in so far as is 'necessary to fully understand 
this invention. 

Caches are high-cost, high-speed memories that provide an important 
performance optimisation in processors. This is done by keeping copies of the 
contents of most commonly used locations of; main memory near to the 
processor, namely in cache locations. As a result accesses to the contents of 
these memory locations are much quicker. 

The instmction cache is responsible for optimising Accesses to the program being 
executed. The cache will usualy be smaller than the size of the program, 
meaning that the contents of the cache will need; to change to ensure that the 
parts of the program currently being executed are iri the bache. 

In designing the instruction cache a trade-off between cost end performance has 
to be made. Two of the key parameters that can changed are the cache's size 
and associativity. These both influence the resulting sillicon area and maximum 
clock frequency of the cache. 



The size of a cache is determined by a numb^ of factors, but will depend 
primarily on area limitations and target appiic?ations Of the design. 

Determining the appropriate level of associativfty of ithe cache can be harder. 

For a direct-mapped cache, each block in mair^ memory maps to a unique 
location (line) in the cache. Thait is a "block" In memory is a chunlc of data 
corresponding in size to a cache location. If two bldcl<s map to the same line then 
they cannot be in the cache at the same time an^i will continually replace each 
other. This case is refen"ed to as a conflict. j 

For a set-associative cache, each block maps to a iset of lines. The block can be 
stored in any of the lines in the set Note that because the number of lines in the 
cache is constant, dividing the cache into sets m^ans that mors blocks map to 
each set. In general, the cache will be more effedtive With a reasonable level of 
associativity because it can decide which lines it wifill replace and which lines will 
be kept. 

However, there are at least two reasons why a direct-mapped cache may be 
chosen, namely higher potential dock frequency and smaller area than a set- 
associative cache of the same size. 

The disadvantage of a direct-mapped instruction cache is that conflicting 
addresses can cause large perfonmance loss, Ajs an example consider a real 
graphics application in an MPEG decoder. The graphics application includes a 
number of different functions, and in particular a; variable length decode (VLD) 
and an Inverse discrete cosine transfonn {ID&T) Hinction which are used 
extremely often and in fact often in sequence on ebch new data set. That is, it is 
almost sure that if one is used, the other will be- used subsequently in a short 
space of time. If they were to map to the samd lines in the cache then there 
would be a conflict each time execiution moves fronh one function to the other. 



The results of such conflicts are performance losses as the code would have to 
be loaded from memory every time it was needed, and an increase of bus traffic. 

The most common way of ensuring that there are no performance critical conflicts 
is to use a set-associative cache. This reduces the chances of conflicts 
dramatically, as the number of conflicting blocks must be greater than the number 
of lines in the set for the same perfomnance loss to bccur' 

Another way of reducing the impact of conflicts is tt use a victim cache. This will 
normally be a small, fully associative cache that stores Ihe last few entries that 
have been evicted from the main cache. This cani be an effective way of coping 
with a small number of conflicts. However, the effectiveness will vary highly 
depending on the size of the victim cache and the Application being run. 

The disadvantage of both of these solutions Is that they impose hardware 
constraints on the design. The set-associative ca|;he requires more silicon area 

and will limit the processor's maximum clock frecjuency. Using a victim cache 

I 

increases the silicon area. [ 

Direct-mapped caches are not very commonly us$d because conflicts can have 
unpredictable and detrimental effescts- 

It is an aim of the present invention to reduce or: eliminate conflicts in a direct- 
mapped cache to allow advantage to be taken of the smaller area and higher 
dock frequencies characteristic of such caches. 

Summary of the Invention 

According to one aspect of the invention there is provided a method of evaluating 
a set of memory maps for a program comprising a plurality of functions, ths 
method comprising: (a) executing fii^t version of the program according to a first 
memory map to generate a program counter tradse; (b) converting the program 
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counter trace into a format defining a memory location in association witli a 
function and an offset witliin ttie function using tlie first memory map; (c) 
translating the program counter trace Into physical dddregses using one of the set 

of memory maps to be evaluated, different fronh the first memory map; (d) 
evaluating the number of likely cache misses usin^ a model of a direct-mapped 
cache for that one memory map; and repeating steps (c) and (d) for each of the 
memory maps in the set. 

Another aspect provides a method of operating a i:x)mputer to evaluate a set of 
memory maps for a program comprising a plurality of functions, the method 
comprising: loading a first version of the program into the computer and executing 
said first version to generate a program counter trace; Ipadlng into the computer a 
memory map evaluation tool which carries out Ithe steps of: converting the 
program counter trace into a format defining a nr]emory location in association 
with a function and an offeet within the function using the first memory map; 
translating the program counter trace into physical ^ddresses using one of the set 
of memory maps to be evaluated, different fronh the first memory map; and 
evaluating the number of likely cache misses using a model of a direct-mapped 
cache for that one memory map; wherein the step of translating a program 
counter trace and evaluating the number of likely^ cache misses is repeated for 
each of the memory maps in a set to be evaluated." 

Another aspect provides a memory map evalujation tool comprising: a first 
component operable to generate a program counter trace from execution of a first 
version of a program acxK3rding to a first memory! map and to provide from that 
program counter trace a converted forniat defining a memory location in 
association with a function and an. offset within the (function using the first memory 
map; and a second component operable to transiate the program counter trace 
into physical addresses using one of the set of rriiemory maps to be evaluated, 
different from the first memory map, and to evaluate the number of likely cache 
misses using a model of a direct-mapped cache fpr that one memory map under 
evaluation. 



For a better understanding of the present invention and to show how the same 
may be carried into effect, reference will now be nrtede by way of example to the 
accx)mpariying drawings. 

Brief Description of the Drawings 

Figure 1 is a schematic diagram illustrating mapping between a memory 
and a direct-mapped cache and s four way set asscfciative cache; 

Figure 2 is an example of an iVIPEG decoder application stored in memory 
and its mapping to a cache; | 

Figure 3 is an example of a memory map; 

Figure 4 is a schematic block diagram of -a software tool for altering a 
memory map to improve cache mapping; and 

Figure 5 is a flow chart illustrating operation the tool of Figure 4. 

Description of the Preferred Embodiment 

Figure 1 illustrates liie relationship between memoiy locations and cache lines in 
a four way set associative cache and a direct-mapjsed cache. The main memory 
Is denoted by reference numeral 2 shown to have ^ plurality of program blocks. A 
direct-mapped cache is denoted by reference nUmeraf 4 and is shown with a 
plurality of numbered cache lines. Each block mapfe onto a single cache line only, 
with the result that several different blocks all m^p exclusively onto the same 
cache line. Consider for example blocks 1 , 513 artd 1025 which all map onto line 
1 of the cache. 

Reference numeral 6 denotes a four way set assofciative cache from which it can 
be seen that each block maps onto a plurality of lines in the cache. In particular 
blocks 1, 513 and 1025 all map onto Set 1 but therie are four lines to choose from 
wrthin the set where the contents of those locations ai main memory could be 
held. 



The potential difficulty with a dlreolt-mappeci cache IwhicH does not exist in a four 
way set associative cache can readily be seen fromj Figure 1 , That is, if bloci^ 1 is 

in the cache (a! line 1) and then block 513 is to be] executed, the only location in 
the cache suitable for accepting block 513 is line 1^ which requires the eviction of 
block 1. If block 1 (or indeed block 513) is not ofterji used, this Is probably not too 
much of a problem. However, in programs where block 613 is often used, and in 
particular is often used after block 1 , this requires] more or less constant cache 
eviction and replacement which affects performanqe and increases bus traffic as 
discussed above, 

! 

Figure 2 is an example of an MPEG decoder application stored in main memory 2 
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and including a variable length decode function (VLD) and an inverse discrete 
cosine transform (IDCT). Assume, as shown by the anrows, that these functions 
relate to blocks which map onto the same line or lihes ill the instruction rache 4, 
Due to the frequent usage of these functions withifi the decoder application, this 
would be a situation where a direct-mapped cache tould be ineffective. 

The software tool discussed in the following, hoWever, allows a direct-mapped 
cache to be used in such a situation without a negative impact on perfonmance. 

In brief, the tool changes the memory map of a program in order to minimise 
conflicts and hence increase performance. Creating a new memory map simply 
means placing the functions in a new onder in memory. 

Figure 3 illustrates a program P comprising a j^lurality of functions labelled 
Function 1, Function 2 etc, of differing sizes held in a memory 2. The blocks 
labelled 4A, 4B and 4C each represent the full dire?>t-mapped cache and illustrate 
the mapping of the program functions in the cache. Fronrt this it can be seen that, 
for example, Function 1 maps onto the same cdche lines as the end part of 
Function 3 and the end part of Function 8. Equivalent mappings can be seen 
further from the block 4A, 4B and 4C in Figure 3J The software tool discussed 



herein alters the order of the functions of the program as^ stored in the memory 2 

such that their relative mapping Into the cache J differs to negate or reduce 
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conflicts. 

An extremely effective method of optimising the mapping for the instnjction cache 
relies on the ability to generate -traces of the pfogram Counter (PC) as the 
program 3 executes on a typical data set 5. Fjigure 4 illustrates a memory 
mapping tool 6 which wori<s in this way where th^ execution is denoted by an 
execute blocl< 7, and Figure 6 in a flow diagram. 

Initially, a program 3 is compiled (Step S1), its meiinory map 10 generated {by a 
linker at link time - Step S2) and then executed (^S) on a typical data set 5, A 
PC trace 8 is produced following tills execajtion. 

The trace 8 is converted (S4) to a function/offset jFormat using the first memory 
map 10 of the program. For example, if the idct furi^cflon (see Figure 2) started at 
address Ox08003baO, the address 0x08003ba8 wbuld become idct 0x08. See 
Table 1 below. 

Table 1 ' 



Program Counter Trace 


Annotateld trace 
Format: ijlmciton offset 


0x08001 1f4 


main 0x50 


Ox08003baO 


Idct 0x00' 


Ox08003ba4 


Idct 0x04 : 


0x08003ba8 


Idct 0x08! 


OxOSOOSbac 


IdctOxOci 


0x08001 1f8 


main 0x5f 


0x08001 1fc 


main 0x50 


0x080046f8 


1 

exit 0x00 ! 


0x08004efc 


exit 0x04' 


0x08004700 


exit 0x08' 



; 8' i 

j 

The tool 6 uses this trace format tb explore new mOTory maps (labelled Memory 

Map 1 , Memory Map 2 etc. in figure 4), looking for one that generates the 
minimum number of instruction cache misses, Tijis process of exploration has 

. i 

the advantage that the time to evaluate each memory map is much quicker than 
actually re-linking and benchmarking the program. ; 

Evaluating a memory map (Step S5) is done by! translating the function/offset 
trace 8 (e.g. "main 0x58") back to physical PC addresses by translator 12 and 
passing them through a simple cache model (Step|S6). The physical address of 

each function is calculated using 6ach memory majp 10', 10" to be evaluated and 

j 

the code size of each function. The physical 'PC addresses can then be 
calculated by simply adding the offset to the base pihysical address of the function 
given in the memory map under evaluation. 

The cache model 14 counts the total number of! cache misses (Step S7) that 
would be caused if the application were to be ^4^inkea and run on the actual 
hardware with the given memory map. The results] are stored and compared with 
results for subsequently evaluated memory maps 4o that the memory map giving 
the least number of misses can be identified. Tftet memory map is stored and 
used to relink the program (S10). 

A very basic generic algorithm is to eiqjiore poten|tlal memory maps for the one 
with the best perfomtance. The user chooses the number of memory maps 10, 
10', 10" In the set SET 1 to be explored on ^ach Iteration, and criteria for 
terminating the search by the tool 6. • 

At the start, each of the memory; maps in the set' is randomised. Then the tool 
iterates until the end criteria are rriet. 

A single iteration consists of two stages: evaluajting the performance of each 
memory map in the set and creating a new set iof memory maps for the next 
iteration. 
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The memory maps are evaluated as described above, with the number of misses 
being used as the measure of performance. The tess misses, the less time the 
program would spend stalled on the hardware. \ 

Once the memory maps in the set have been evaluated, the aim is to create new 
memory maps that reduce the number of misses. The best memory map found 
so far will always be kept, while the rest of the membry maps will be replaced with 
new ones. The new ones are created using three techniques: 

• Random swap - Take the . best memory map and perform a swap of two 
random functions. 

! 

• Merging - If two or more miemory maps on Ihls Iteration have improved on 
the previous best then merge the changes ofj each, 

• Target functions - Misses can be classified as either: Compuisofy - 
misses that would occur even In an infinite ciache because the code has to 
be loaded in before it is executed. Conflict^ - miiises that would not have 
occurred in a fully associative cache of the s^me size. Capacity - all other 
misses are simply due to. the size of thd cache. Those that can be 
eliminated are the conflict misses which ar^ usually caused by functions 
clashing with each other. In order to eilminab these misses, funcflons that 
are causing the most conflict misses are targjeted for swapping. 

The tool stops Iterating once the user's end criteri^ has been met This may be 
after a number of iterations, or a set number of mioses lias been reached, or the 
tool has failed to find a better memory map for a nu^nberpf iterations, 

I 

On exit, the tool dumps the memory map of the optimal solution found so that the 
real program can be linked using that memory rjnap. It also reports the total 
number of misses that shouid be produced by the imemOry map, and the number 
of compulsory misses there are (due to total code ^ize executed). The ratio of the 



^10 , : 

total misses to compulsory mlssesi gives a good in<Sicatidn of the effectiveness of 
the tool. 

This software optimisation method is not guaranteed to work for ail applications, 
but there are many suitable applications where thte optimisation method can be 
used effectively, allowing direct-mapped caches to pe used. 

Essentially, optimising a program for the instruction cafche wlli work well if the 
program demonstrates repeatable execution flow. This is true of many streaming 
data (audio/video) applications, where typical data jsets can be used to determine 
the execution flow of the application. 



